Parkinson's disease Classification - Ensemble Techniques Project

Objective:

The goal of this binary classification task is to predict the patients as having parkinson's disease or not based on voice recording data to act as an effective screening step for doctors to consider a patient for further diagnostics by a clinician..

Executive Summary:

  Parkinson's disease is a brain disorder that leads to shaking, stiffness, and difficulty with walking, balance, and coordination. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. So, we aim to use the voice recording data of a patient to predict a patient having Parkinson's disease, it would act as an effective, non invasive screening step before requiring a clinic visit to test those at-risk patients and diagnose them.
The dataset was created by Max Little of the University of Oxford, in  collaboration with the National Centre for Voice and Speech, Denver,  Colorado, who recorded the speech signals. It is composed of a range of biomedical voice measurements from  31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data  is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

We go through the machine learning pipeline, starting with reading the dataset and exploring the data through plots and summaries. Then, we move to preprocess the data to standardize the data and check for any missing values. Later, we build models to classify the data. 

Finally, we evaluate the best models using the whole test dataset.

Looking at the dataset from a data completeness perspective, it only has 195 rows. This will likely lead to poor machine learning results. We have to have more data if we want our model to generalize well on out of sample data and our results or insights to be applicable in the real world. The variable name are all unique values and do not add any required information. So, we can remove this attribute for our modelling. Also, all the numerical values seem to have different ranges, So, we would have to scale the data before fitting models so that the variables with larger scales do not bias the model.
The target column is status (cateogical) and there seems to be an imbalance in the classes as there seem to be more positive examples (status = 1) than negative ones. So, we might need to employ certain upsampling techniques like SMOTE to balance the dataset. Overall, This is a binary classification problem, where the machine learning model will try to predict if each row is (status) 0 or 1.

Challenges:

Attribute Information:

We confirm that there are no missing values(NAs). Hence, we do not need to remove or impute missing
values. If there were missing values we do some value imputation using medians or maybe knn imputation based on the data points nearest neighbors

From this point of view, the data looks great and there are no missing values.

We can see that the variables MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz), HNR, RPDE, DFA, spread1, D2, PPE all have unique values for each data point.

We can see that the variable MDVP:Jitter(Abs) seems to follow a non-standard skewed distribution with only 19 unique values.

Exploratory Data Analysis

Univariate Plots

As only 24.62% of records are from patients without parkinson's disease, the target attribute is heavily imbalanced. So, we might need to employ techniques like upsampling, downsampling or SMOTE so that classification is done properly.

Correlation heatmap

  • If there is multicollinearity, then we are unable to understand  and how one variable influences the target. There is no way to estimate separate influence of each variable on the target.
  • Spread1 and PPE are highly correlated.
  • We can see that attributes MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, NHR, and Jitter:DDP are highly correlated with each other
  • Also, attributes MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA, HNR  are highly correlated to each other.
  • Hence, we have  to replace all these feature with only three features since adding highly correlated variables will not improve model performance but increase complexity.
  • So, of the three groups we pick the feature in the group which has the most correlation with the target variable and ideally exhibits a normal or standard distribution.

spread1 and PPE both seem to be normally distributed. We can confirm this by looking at the describe statement above, which shows mean is almost equal to median
All the other highly correlated variables seem to have a non-standard right-skewed distributed with some outliers lying to the right. Excluding D2, spread2, other variables seem to have a non-standard distribution.

spread1 is more highly correlated with the target variable than PPE. So, we select only the feature spread1 of the two for modeling

MDVP:Jitter(Abs) is more highly correlated with the target variable than the other correlated variables. So, we select only the feature MDVP:Jitter(Abs) of this group for modeling

MDVP:Shimmer is more highly correlated (albeit negatively) with the target variable than the other correlated variables. So, we select only the feature MDVP:Shimmer of this group for modeling

  • Distribution of attributes MDVP:Fo(Hz), RPDE, DFA, spread2, D2, PPE are fairly normal.
  • Attributes MDVP:Fhi(Hz), MDVP:Flo(Hz), MDVP:Shimmer, MDVP:Jitter(Abs) are right skewed with a few outliers.

Bivariate Plots with target

  • Patient PD Status doesn't show much variations with MDVP:Fhi(Hz) and D2.
  • spread1 and spread2 has a good effect on the target varialbe.
  • Other predictors also show a weak relationship with the target variable.
  • Now, we can see that all the selected features are independent and some have a normal distribution.

Data Preprocessing

Over Sampling SMOTE

Split Training and Testing Datasets

Scaling

  • Now, we can see that the scales or ranges of all the features are similar with mean being approximately zero and standard deviation of approximately one. Hence, we can proceed to train the models on this dataset.

Model Building

Base Models

Logistic Regression

KNearest Neighbors Classifier

Naïve Bayes Classifier

Support Vector Classifier

Decision Tree Classifier

Train a meta classifier

Meta Classifier is simply the classifier that makes a final prediction among all the predictions by using those predictions as features. So, it takes classes predicted by various classifiers and pick the final one as the result that you need.

image.png

Ensemble Models

XGBoost Classifier (Boosting model)

Random Forest Classifier (Bagging model)

Hence, we can confirm that the most important features for the target prediction are spread1, MDVP:Fo(HZ), spread2 along with somewhat important features MDVP:Shimmer, MDVP:Fhi(Hz), MDVP:Jitter(Abs) and D2, MDPV:Flo(Hz), DFA, RPDE are weak predictors


Model Evaluation

Random Forest Classifier is an ensemble model which employs multiple decision trees to classify. So, it's not prone to overfitting and generally quite robust in classification tasks. XGBoost Classifier is a boosting ensemble model in which weak learners are trained sequentially so that each model tries to reduce the error made by the previous model.
Random Forest Classifier performs well on this dataset as it's able to learn from similar datapoints. i.e, Similar data points with similar values for the attributes tend to have similary target response. Also, XGBoost Classifier is able to learn.
However, one drawback of ensemble models is the requirement of more compute and trainng time and also an increased inference latency. This might not make them viable in a production environment where latency of predictions is critical.

We have chosen the F1 Score as the metric to judge our models since we are concerned with Positive Class and the Classification Class is Imbalanced. So, Using F1 Score is a viable option to judge the models as it takes into consideration both False Positives and False Negatives while scoring.
Also, looking at the confusion matrices for all models we can see that the Meta Classifier and the Support Vector Classifier are the best models.
Further, Meta Classifier is expected to generalize better than Support Vector Classifier on unseen data as it depends on many models and has reduced variance.